---
# jupyter: python3
title: "Hands On: Introduction to Large Language Models (LLMs)"
description: "An introduction to Large Language Models (LLMs), their architecture, and how they can be applied to scientific applications."
date: 2025-07-23
date-modified: last-modified
---
<a href="https://colab.research.google.com/github/argonne-lcf/ai-science-training-series/blob/main/04_intro_to_llms/IntroLLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
::: {.callout-note title="Authors" collapse="false"}
Content in this notebook is modified from content originally written by:
- Archit Vasan, Huihuo Zheng, Marieme Ngom, Bethany Lusch, Taylor Childers,
Venkat Vishwanath
Inspiration from the blog posts "The Illustrated Transformer" and "The
Illustrated GPT2" by Jay Alammar, highly recommended reading.
:::
Although the name "language models" is derived from Natural Language
Processing, the models used in these approaches can be applied to diverse
scientific applications as illustrated below.
## Outline
During this session I will cover:
1. Scientific applications for language models
2. General overview of Transformers
3. Tokenization
4. Model Architecture
5. Pipeline using HuggingFace
6. Model loading
## Modeling Sequential Data
Sequences are variable-length lists with data in subsequent iterations that
depends on previous iterations (or tokens).
Mathematically:
A sequence is a list of tokens:
$$T = [t_1, t_2, t_3,...,t_N]$$
where each token within the list depends on the others with a particular
probability:
$$P(t_N | t_{N-1}, ..., t_3, t_2, t_1)$$
The purpose of sequential modeling is to learn these probabilities for possible
tokens in a distribution to perform various tasks including:
- Sequence generation based on a prompt
- Language translation (e.g. English --> French)
- Property prediction (predicting a property based on an entire sequence)
- Identifying mistakes or missing elements in sequential data
## Scientific sequential data modeling examples
### Nucleic acid sequences \+ genomic data
::: {#fig-RNA-codons}

RNA Codons
:::
Nucleic acid sequences can be used to predict translation of proteins, mutations, and gene expression levels.
Here is an image of GenSLM. This is a language model developed by Argonne researchers that can model genomic information in a single model. It was shown to model the evolution of SARS-COV2 without expensive experiments.
::: {#fig-genslm}

Genomic Scale Language Models (GenSLM)
[Zvyagin et. al 2022. BioRXiv](https://www.biorxiv.org/content/10.1101/2022.10.10.511571v1)
:::
### Protein sequences
Protein sequences can be used to predict folding structure, protein-protein interactions, chemical/binding properties, protein function and many more properties.
::: {#fig-protein-structure}

Protein Structure
:::
::: {#fig-esmfold}

ESMFold
[Lin et. al. 2023. Science](https://www.science.org/doi/10.1126/science.ade2574)
:::
### Other applications
* Biomedical text
* SMILES strings
* Weather predictions
* Interfacing with simulations such as molecular dynamics simulation
## Overview of Language models
We will now briefly talk about the progression of language models.
### Transformers
The most common LMs base their design on the Transformer architecture that was introduced in 2017 in the "Attention is all you need" paper.
::: {#fig-attention-is-all-you-need}

Attention is all you need
[Vaswani 2017. Advances in Neural Information Processing Systems](https://arxiv.org/pdf/1706.03762)
:::
Since then a multitude of LLM architectures have been designed.
::: {#fig-ch1-transformers}

Transformers, chronologically
:::
[HuggingFace NLP Course](https://huggingface.co/learn/nlp-course/chapter1/4)
## Coding example of LLMs in action!
Let's look at an example of running inference with a LLM as a block box to
generate text given a prompt and we will also initiate a training loop for an
LLM
Here, we will use the `transformers` library which is as part of HuggingFace, a
repository of different models, tokenizers and information on how to apply
these models
::: {.callout-warning collapse="false" title="🦜 Stochastic Parrots"}
**Warning**: _Large Language Models are only as good as their training data_.
They have no ethics, judgement, or editing ability.
We will be using some pretrained models from Hugging Face which used wide
samples of internet hosted text.
The datasets have not been strictly filtered to restrict all malign content so
the generated text may be surprisingly dark or questionable.
They do not reflect our core values and are only used for demonstration
purposes.
:::
```{python}
#| colab: {base_uri: https://localhost:8080/, height: 35}
'''
Uncomment below section if running on sophia jupyter notebook
'''
# import os
# os.environ["HTTP_PROXY"]="proxy.alcf.anl.gov:3128"
# os.environ["HTTPS_PROXY"]="proxy.alcf.anl.gov:3128"
# os.environ["http_proxy"]="proxy.alcf.anl.gov:3128"
# os.environ["https_proxy"]="proxy.alcf.anl.gov:3128"
# os.environ["ftp_proxy"]="proxy.alcf.anl.gov:3128"
```
```{python}
#| colab: {base_uri: https://localhost:8080/}
#| output: false
!pip install transformers
!pip install pandas
!pip install torch
```
```{python}
import ambivalent
import matplotlib.pyplot as plt
import seaborn as sns
import ezpz
console = ezpz.log.get_console()
plt.style.use(ambivalent.STYLES['ambivalent'])
sns.set_context("notebook")
plt.rcParams["figure.figsize"] = [6.4, 4.8]
```
```{python}
#| colab: {base_uri: https://localhost:8080/}
from transformers import AutoTokenizer,AutoModelForCausalLM, AutoConfig
input_text = "My dog really wanted to eat icecream because"
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
pipe = pipeline("text-generation", model="gpt2")
generator(input_text, max_length=20, num_return_sequences=5)
```
## What's going on under the hood?
There are two components that are "black-boxes" here:
1. The method for tokenization
2. The model that generates novel text.
## Tokenization and embedding of sequential data
Humans can inherently understand language data because they previously learned phonetic sounds.
Machines don’t have phonetic knowledge so they need to be told how to break text into standard units to process it.
They use a system called “tokenization”, where sequences of text are broken into smaller parts, or “tokens”, and then fed as input.
<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/text-processing---machines-vs-humans.png?raw=1" width="400"/>
</div>
Tokenization is a data preprocessing step which transforms the raw text data into a format suitable for machine learning models. Tokenizers break down raw text into smaller units called tokens. These tokens are what is fed into the language models. Based on the type and configuration of the tokenizer, these tokens can be words, subwords, or characters.
Types of tokenizers:
1. Character Tokenizers: Split text into individual characters.
2. Word Tokenizers: Split text into words based on whitespace or punctuation.
3. Subword Tokenizers: Split text into subword units, such as morphemes or character n-grams. Common subword tokenization algorithms include:
1. Byte-Pair Encoding (BPE),
2. SentencePiece,
3. WordPiece.
<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/tokenization_image.webp?raw=1" width="400"/>
</div>
[nlpiation](https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa)
### Example of tokenization
Let's look at an example of tokenization using byte-pair encoding.
```{python}
#| colab: {base_uri: https://localhost:8080/}
from transformers import AutoTokenizer
# A utility function to tokenize a sequence and console.log out some information about it.
def tokenization_summary(tokenizer, sequence):
# get the vocabulary
vocab = tokenizer.vocab
# Number of entries to console.log
n = 10
# console.log subset of the vocabulary
console.log("Subset of tokenizer.vocab:")
for i, (token, index) in enumerate(tokenizer.vocab.items()):
console.log(f"{token}: {index}")
if i >= n - 1:
break
console.log("Vocab size of the tokenizer = ", len(vocab))
console.log("------------------------------------------")
# .tokenize chunks the existing sequence into different tokens based on the rules and vocab of the tokenizer.
tokens = tokenizer.tokenize(sequence)
console.log("Tokens : ", tokens)
console.log("------------------------------------------")
# .convert_tokens_to_ids or .encode or .tokenize converts the tokens to their corresponding numerical representation.
# .convert_tokens_to_ids has a 1-1 mapping between tokens and numerical representation
# ids = tokenizer.convert_tokens_to_ids(tokens)
# console.log("encoded Ids: ", ids)
# .encode also adds additional information like Start of sequence tokens and End of sequene
console.log("tokenized sequence : ", tokenizer.encode(sequence))
# .tokenizer has additional information about attention_mask.
# encode = tokenizer(sequence)
# console.log("Encode sequence : ", encode)
# console.log("------------------------------------------")
# .decode decodes the ids to raw text
ids = tokenizer.convert_tokens_to_ids(tokens)
decode = tokenizer.decode(ids)
console.log("Decode sequence : ", decode)
tokenizer_1 = AutoTokenizer.from_pretrained("gpt2") # GPT-2 uses "Byte-Pair Encoding (BPE)"
sequence = "Counselor, please adjust your Zoom filter to appear as a human, rather than as a cat"
tokenization_summary(tokenizer_1, sequence)
```
### Token embedding:
Words are turned into vectors based on their location within a vocabulary.
The strategy of choice for learning language structure from tokenized text is to find a clever way to map each token into a moderate-dimension vector space, adjusting the mapping so that
Similar, or associated tokens take up residence nearby each other, and different regions of the space correspond to different position in the sequence.
Such a mapping from token ID to a point in a vector space is called a token embedding. The dimension of the vector space is often high (e.g. 1024-dimensional), but much smaller than the vocabulary size (30,000--500,000).
Various approaches have been attempted for generating such embeddings, including static algorithms that operate on a corpus of tokenized data as preprocessors for NLP tasks. Transformers, however, adjust their embeddings during training.
## Transformer Model Architecture
Now let's look at the base elements that
make up a Transformer by dissecting the popular GPT2 model
```{python}
#| colab: {base_uri: https://localhost:8080/}
from transformers import GPT2Tokenizer, GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('gpt2')
console.log(model)
```
GPT2 is an example of a Transformer Decoder which is used to generate novel text.
Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models. The pretraining of decoder models usually revolves around predicting the next word in the sentence.
These models are best suited for tasks involving text generation.
The architecture of GPT-2 is inspired by the paper: "Generating Wikipedia by Summarizing Long Sequences" which is another arrangement of the transformer block that can do language modeling. This model threw away the encoder and thus is known as the “Transformer-Decoder”.
<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/transformer-decoder-intro.png?raw=1" width="500"/>
</div>
[Illustrated GPT2](https://jalammar.github.io/illustrated-gpt2/)
Key components of the transformer architecture include:
* Input Embeddings: Word embedding or word vectors help us represent words or text as a numeric vector where words with similar meanings have the similar representation.
* Positional Encoding: Injects information about the position of words in a sequence, helping the model understand word order.
* Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sentence, enabling it to effectively capture contextual information.
* Feedforward Neural Networks: Process information from self-attention layers to generate output for each word/token.
* Layer Normalization and Residual Connections: Aid in stabilizing training and mitigating the vanishing gradient problem.
* Transformer Blocks: Comprised of multiple layers of self-attention and feedforward neural networks, stacked together to form the model.
### Attention mechanisms
Since attention mechanisms are arguably the most powerful component of the Transformer, let's discuss this in a little more detail.
Suppose the following sentence is an input sentence we want to translate using an LLM:
`”The animal didn't cross the street because it was too tired”`
To understand a full sentence, the model needs to understand what each word means in relation to other words.
For example, when we read the sentence:
`”The animal didn't cross the street because it was too tired”`
we know intuitively that the word `"it"` refers to `"animal"`, the state for `"it"` is `"tired"`, and the associated action is `"didn't cross"`.
However, the model needs a way to learn all of this information in a simple yet generalizable way.
What makes Transformers particularly powerful compared to earlier sequential architectures is how it encodes context with the **self-attention mechanism**.
As the model processes each word in the input sequence, attention looks at other positions in the input sequence for clues to a better understanding for this word.
<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/transformer_self-attention_visualization.png?raw=1" width="400"/>
</div>
[The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
#### Multi-head attention
In practice, multiple attention heads are used simultaneously.
This:
* Expands the model’s ability to focus on different positions.
* Prevents the attention to be dominated by the word itself.
#### Let's see multi-head attention mechanisms in action!
We are going to use the powerful visualization tool bertviz, which allows an interactive experience of the attention mechanisms. Normally these mechanisms are abstracted away but this will allow us to inspect our model in more detail.
```{python}
#| colab: {base_uri: https://localhost:8080/}
#| output: false
!pip install bertviz
```
Let's load in the model, GPT2 and look at the attention mechanisms.
**Hint... click on the different blocks in the visualization to see the attention**
```{python}
#| colab: {base_uri: https://localhost:8080/, height: 907, referenced_widgets: [ad748b4c592645cb90d427c9012d7368, a34cdc4592cf46ad9ccf69caa52363bd, c1eac1886ea84c87bed161c960569d3d, ca50498e237843709d8c44e7ea0e94b5, 8f4387f0bb3e402ab1ab4d443c693f96, 47316e9804804c8d9c9b031fa3d86052, 6cc5bd956c694bc5b63c0c7bba2a29f8, 93728bade1714bcabbe90f1b9b95b7c2, 83cea646339b4e74a9582827afdf5753, c765d253aa974bc59654f0398afb964d, 3d71478078bc4743a580dc6200cccbd0, 099e1574f4a14c308be2364d0bba2a76, 6f822f4ae7734ce09736d50bf17ad99e, 93fbd0ce38df488fa725d7c584344e6b, ee56c1c4c2be4086883a95e15adc655d, 846bb66c196540febf6a7bc32b60dc3a, 37dcd77ea3154208be44bfa35a88fa90, fdd9ba7947794f4aa0e2782e3b379f61, 64fa9056ffef4ab3a4e2db085a48cbfb, acee154ece2e4a939f42236ec38de871, 46acb20f74174029962ecb8142c576fb, b58662cfa5fb4406a171135e01d25368, c770f4a4cad0435ebedbb932c1431f52, 14dab4fee3d34783bf838cfbd4bdbcaa, d00cd656bc4f4f0d9cf52b942fceb155, d5fbc43ad5204bb88b11ddceaedc5d6d, 8d825e3d51524720ada2f1ecd9a9ffd5, 9a17934807c140eeb8c0a5cf80c4cb50, 7d2254be42234d24b044f6c9c0d49119, 9eddddc3c9034e6da5bf612da55f5714, 9548afdd0d154788ba3edb96c326c85f, 2213cf9588c545f2b200186766ac425d, d56ccfc975d949b7a6cbe7b9a89a1101, ff3907fe32a14ae681a6a01933e6eaf5, f08eac6bbad843ca95157d035c1ef043, 7e8af1c44d984d99a2b27e80c05cec75, 0a381955a75148d5a59815efb2b352bd, 7ac42549074746b48926a9cffe0a542c, 55b65d9fa1e44149bfc763a8c99e7c2a, cd7ca2bbf9b04de4842b316ea7a5535f, 0c0b4f2c4ef14413bc24340afc040431, 6442fa46c934428998cf8e3d5835e8bc, 8da520e3f829425b83e9b92149f792ad, 78c5836dbb6b4538abcc3cd348c5b958, 3f9fa0a2a0614bae87f7c698d043a3b5, 6c5fb86539584d6dad2af92729b9a1ef, 1eb460d0dad44857a14646782366c7c1, 67c1873c51d34503833366af483744a3, 713fef79f5d441b18f372c64d9569551, 4b5d59490f264d03a9ac52897ae94d33, 42519c4cca864ad78595214e86079c8f, bd3444ba231347378ae1a7ffa4181648, d701e46a5472481ea942b613e3506f03, e2effa7f78034f5c98a2c7f2d12b7ec3, 8ee1c3f4f84f49b9aa78590d20e03005, 00d232fe28f74a1ea9fad0510586ff9c, e2e8c12232854688802988193dd6bdc8, 3129da774c2c4298afbf3d766ca7f428, 0d36758d65ed404e8181bdc4e2a9d26b, 3e75401bc3bf42da8f9e1a377401b69d, fbd25c280a554d50ace7384d5939e335, 3371add482984a3bab5c31379484af71, 001363aab0a143a683a07e48cbbf5594, c87375d637f74dc4ab4bab00852c758c, b5f27258ef4a4b9e8b7d13387b14d9a1, d201d648b932443ca9141aa9129e5661, 0a2f04862b3a4c36becf5949c67a14a5, 8f57a09202cf4cddb2ba697ecfc78daf, a9fae2f4af744d33bddeeb48420cda3e, 2994daa8857945509c97e3f2cdb4ee97, 593dfccfa5ca4e56a32e9cead70714d2, 5f3d428e20334a3da4c82d0fc2e3eb3e, 5e028ae3b1824c74a0d610129dcc8246, 28ba4c7a70294e589cb2ced5766072e4, d3a9c7752afb49a5bb1ea70bdb8744fe, 09eed5891aa647da9f7c9beceac50622, 4868e95bf5ec4cecab53e5e65862aea6]}
from transformers import AutoTokenizer, AutoModel, utils, AutoModelForCausalLM
from bertviz import model_view
utils.logging.set_verbosity_error() # Suppress standard warnings
model_name = 'openai-community/gpt2'
input_text = "No, I am your father"
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt') # Tokenize input text
outputs = model(inputs) # Run model
attention = outputs[-1] # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) # Convert input ids to token strings
model_view(attention, tokens) # Display model view
```
## Pipeline using HuggingFace
Now, let's see a practical application of LLMs using a HuggingFace pipeline for classification.
This involves a few steps including:
1. Setting up a prompt
2. Loading in a pretrained model
3. Loading in the tokenizer and tokenizing input text
4. Performing model inference
5. Interpreting inference output
```{python}
# STEP 0 : Installations and imports
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch
import torch.nn.functional as F
```
### 1. Setting up a prompt
A "prompt" refers to a specific input or query provided to a language model. They guide the text processing and generation by providing the context for the model to generate coherent and relevant text based on the given input.
The choice and structure of the prompt depends on the specific task, the context and desired output. Prompts can be "discrete" or "instructive" where they are explicit instructions or questions directed to the language model. They can also be more nuanced by more providing suggestions, directions and contexts to the model.
We will use very simple prompts in this tutorial section, but we will learn more about prompt engineering and how it helps in optimizing the performance of the model for a given use case in the following tutorials.
```{python}
# STEP 1 : Set up the prompt
input_text = "The panoramic view of the ocean was breathtaking."
```
### 2. Loading Pretrained Models
The AutoModelForSequenceClassification from_pretrained() method instantiates a sequence classification model.
Refer to https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodels for the list of model classes supported.
"from_pretrained" method downloads the pre-trained weights from the Hugging Face Model Hub or the specified URL if the model is not already cached locally. It then loads the weights into the instantiated model, initializing the model parameters with the pre-trained values.
The model cache contains:
* model configuration (config.json)
* pretrained model weights (model.safetensors)
* tokenizer information (tokenizer.json, vocab.json, merges.txt, tokenizer.model)
```{python}
#| colab: {base_uri: https://localhost:8080/, height: 671, referenced_widgets: [bd858574deec4181bb0ba443be9a2088, b025f1fa9b61436f93515742e324b86f, 5073ff96876f4e04b246e466be18c186, 0c9e9d1ae5ed4ae4a7e37513d35f9c77, b29ed3c3e599460cbd3d693d71f456d2, 5b43c22a0fa94a7690d637dcef8f9e0a, 51736c335dd349e7bfcc529ada3c6f6a, 4791ebe7b3cc4444afdcbfd8d1b5967a, 93c49acd409c4616b22b28ef971c9dd6, 4777bd67c1e64692b294830f05c7143d, aeb6c3883112462c8ade793f34ebc4da, 82ebac62b0954db1b9fa3e3ae4025a04, 88bc6af07dc244b8bb3ea694ed31d2ae, 21364965ba974886ad36c5d9670e7a70, 00c33e96a50b4993a9f3387402fbccfe, 9de7b337c654426fac8f86b97c7ca7dd, d3d454f41f164fc5af66adf8d2caa53c, 06cd5703c7134fd49a66a259d0154b85, f6131609549c4e7eb89cbb3f430d8aca, e5a7799affd54336b99042ae28609630, 5ee4ce555b264be4becae07722f88a25, 4e181643cf4e449ca06776ecbba79c5e]}
# STEP 2 : Load the pretrained model.
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
console.log(config)
```
### 3. Loading in the tokenizer and tokenizing input text
Here, we load in a pretrained tokenizer associated with this model.
```{python}
#| colab: {base_uri: https://localhost:8080/, height: 116, referenced_widgets: [067f49dd461a44fabf2b2fbc48c38ea3, 8081fe15a05f4d93a263016d65f7c867, db4f4633a272445fb65b31b81cc13a47, 683187173eee4b4f84c42ee1b030041e, 9b356d4bb2584404b40a53e84ffd08ad, f8844d328728454cabd0c601f1dfac4c, 999973bfe8554a6b89b2214b39348331, 80f90f7e6d644d1d8b7df13c9e85225d, 683ed139b6384e78a7d16d8375640c74, aaca325b53f449519e4193981a4b3865, d1a67e3364a04697a740fac83669d9f9, 9489672461b54621bdb8bb24d0eb9934, 2862c54dff8c4f83829e946c69962221, 95367cef20634367b463b8f0a95e0503, fac8d31cc40e4c9e80f8ca93a9075077, e61474e5f4784cf6ab8a47decc035ae0, b28301ee0a554b009b4a04ab5c3bcaf0, 8a29da2807084308bd8ccf9414150e31, 545d6ab11dc744099b608face7404d0e, 6c4a6d9277f148c6a575b1b7c6c008f7, cac928cad64b40cb8d0ef32f3731451a, a0e7b739e9a44660b0b8b65448d89a93]}
#STEP 3 : Load the tokenizer and tokenize the input text
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"]
console.log(input_ids)
```
### 4. Performing inference and interpreting
Here, we:
* load data into the model,
* perform inference to obtain logits,
* Convert logits into probabilities
* According to probabilities assign label
The end result is that we can predict whether the input phrase is positive or negative.
```{python}
#| colab: {base_uri: https://localhost:8080/}
# STEP 5 : Perform inference
outputs = model(input_ids)
result = outputs.logits
console.log(result)
# STEP 6 : Interpret the output.
probabilities = F.softmax(result, dim=-1)
console.log(probabilities)
predicted_class = torch.argmax(probabilities, dim=-1).item()
labels = ["NEGATIVE", "POSITIVE"]
out_string = "[{'label': '" + str(labels[predicted_class]) + "', 'score': " + str(probabilities[0][predicted_class].tolist()) + "}]"
console.log(out_string)
```
### Saving and loading models
Model can be saved and loaded to and from a local model directory.
```{python}
#| colab: {base_uri: https://localhost:8080/, height: 81, referenced_widgets: [5c584862c1714236be9a4d72eacb7e8c, 878ac7474c814c56a1f8e4ca1307b446, 61d31a811e704e0b940337009892c774, 90d95ae3318b4f4bbda0f5d016347e91, c603255a544f47b280fd61e9e45d9204, bc69805bf372491d931e20940cf99cfc, 9eb1cc51fc0348069875ce5c52c992b9, a36bbfb2408944c39232013f53811f08, 16994401181f43238336a07a06be119b, dc6892bac378490abd7f628bca35906c, 1dcd4ce0732240aab355557623a19d59, 3933d404330d4d02913a7e075819f807, f286b77c68ce47ec9a5177f2dd35ea48, d15d08eb7374438f8edbcc2d78beafc8, a9a334e404fe47da819a7e97d0dbf958, ed261fb3eb6b4661a0dba67a3fc9858b, 95d2b19696f44d099e530f82ad213328, 725a53042a724dc0a9a658bd3dd3cb83, e4dd55f635a74dbeb0067d1c47769123, 832adb3112224a51ba9376a50076dd8c, c953196990624b0ea79c2a3c74cd1de9, 46f89e0a5ae84c6c97265c795fae9f88]}
from transformers import AutoModel, AutoModelForCausalLM
# Instantiate and train or fine-tune a model
model = AutoModelForCausalLM.from_pretrained("bert-base-uncased")
# Train or fine-tune the model...
# Save the model to a local directory
directory = "my_local_model"
model.save_pretrained(directory)
# Load a pre-trained model from a local directory
loaded_model = AutoModel.from_pretrained(directory)
```
## Model Hub
The Model Hub is where the members of the Hugging Face community can host all of their model checkpoints for simple storage, discovery, and sharing.
* Download pre-trained models with the huggingface_hub client library, with Transformers for fine-tuning.
* Make use of Inference API to use models in production settings.
* You can filter for different models for different tasks, frameworks used, datasets used, and many more.
* You can select any model, that will show the model card.
* Model card contains information of the model, including the description, usage, limitations etc. Some models also have inference API's that can be used directly.
Model Hub Link : https://huggingface.co/docs/hub/en/models-the-hub
Example of a model card : https://huggingface.co/bert-base-uncased/tree/main
## Recommended reading
* ["The Illustrated Transformer" by Jay Alammar](https://jalammar.github.io/illustrated-transformer/)
* ["Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)" by Jay Alammar](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)
* ["The Illustrated GPT-2 (Visualizing Transformer Language Models)"](https://jalammar.github.io/illustrated-gpt2/)
* ["A gentle introduction to positional encoding"](https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/)
* ["LLM Tutorial Workshop (Argonne National Laboratory)"](https://github.com/brettin/llm_tutorial)
* ["LLM Tutorial Workshop Part 2 (Argonne National Laboratory)"](https://github.com/argonne-lcf/llm-workshop)
## Homework
1. Load in a generative model using the HuggingFace pipeline and generate text using a batch of prompts.
* Play with generative parameters such as temperature, max_new_tokens, and the model itself and explain the effect on the legibility of the model response. Try at least 4 different parameter/model combinations.
* Models that can be used include:
* `google/gemma-2-2b-it`
* `microsoft/Phi-3-mini-4k-instruct`
* `meta-llama/Llama-3.2-1B`
* Any model from this list: [Text-generation models](https://huggingface.co/models?pipeline_tag=text-generation)
* `gpt2` if having trouble loading these models in
* This guide should help! [Text-generation strategies](https://huggingface.co/docs/transformers/en/generation_strategies)
2. Load in 2 models of different parameter size (e.g. GPT2, meta-llama/Llama-2-7b-chat-hf, or distilbert/distilgpt2) and analyze the BertViz for each. How does the attention mechanisms change depending on model size?